INTRODUCTION

I have taken Wine Quality Red dataset and apply exploratory data analysis techniques to explore relationships in one variable to multiple variables and to explore a selected data set for distributions, outliers, and anomalies. Coding would be done on Rstudio with the help of some packages like ggplot2,knitr,dplyr,alr3,extra grid etc.

Univariate Plots Section

Our dataset consists of 13 variables, with 1599 observations. Quality variable is discrete and the others are continuous.

   # Min.   1st Qu.   Median    Mean   3rd Qu.    Max. 
   # 3.000    5.000    6.000    5.636    6.000    8.000

Red wine quality is normally distributed and concentrated around 5 and 6.

The distribution of fixed acidity is right skewed, and concentrated around 7.9

   # Min.   1st Qu.   Median    Mean    3rd Qu.    Max. 
   # 0.1200  0.3900   0.5200    0.5278  0.6400    1.5800

The distribution of volatile acidity seem to be unclear whether it is bimodal or unimodel,right skewed or normal.

   #   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   #   0.000   0.090   0.260   0.271   0.420   1.000

The distribution of citric acid is not normal

   #   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   #   0.900   1.900   2.200   2.539   2.600  15.500

The distribution of residual sugar is right skewed, and concentrated around 2. There are a few outliers in the plot.

   #    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   # 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

The distribution of chlorides is normal, and concentrated around 0.08. The plot has some outliers.

   #    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   #    1.00    7.00   14.00   15.87   21.00   72.00

The distribution of free sulfur dioxide is right skewed and concentrated around 14

   #    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   #    6.00   22.00   38.00   46.47   62.00  289.00

The distribution of total sulfur dioxide is right skewed and concentrated around 38. There are a few outliers in the plot.

   #    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   #  0.9901  0.9956  0.9968  0.9967  0.9978  1.0040

The distribution of density is normal and concentrated around 0.9967

   #   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   #   2.740   3.210   3.310   3.311   3.400   4.010

The distribution of pH is normal and concentrated around 3.310

   #    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   #  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

The distribution of sulphates is right skewed and concentrated around 0.6581. The plot has some outliers.

   #    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   #    8.40    9.50   10.20   10.42   11.10   14.90

The distribution of alcohol is right skewed and concentrated around 10.20

We divide the data into 3 groups: high quality group contains observations whose quality is 7 or 8, average
quality group contains observations whose quality is 5 or 6 and low quality group has observations whose
quality is 3 or 4. After examining the difference in each feature between the two groups, we see that
volatile acidity, density, and citric acid may have some correation

Univariate Analysis

Structure of the dataset

There are 1,599 red wines in the dataset with 11 features on the chemical properties of the wine.
fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide,
total.sulfur.dioxide, density, pH, sulphates, alcohol, and quality).

Main feature(s) of interest in the dataset

The main features in the data set are pH and quality. I’d like to determine which features are best for
predicting the quality of a wine. I suspect pH and some combination of the other variables can be used to
build a predictive model to grade the quality of wines.

Other features of interest in the dataset

fixed.acidity,sulphates, citric acid, and alcohol likely contribute to the quality of a wine.

Create new variable from existing variables in the dataset

I created a new variable called “quality.cat” which is categorically divided into “low”, “average”, and
“high”.

Bivariate Plots Section

A weak negative correlation of -0.2 exists between percent alcohol content and volatile.acidity.

The correlation coefficient is 0.04, which indicates that there is almost no relationship between residual
sugar and percent alcohol content.

There is a negative correlation between citric acid and volatile acidity.

The correlation coefficient is -0.5, so the relationship is quite clear. As percent alcohol content increases, the density decreases. The reason is simple: the density of wine is lower than the density of pure water.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset? I observed a negative relationships between quality.cat and volatile acidity, and positive correlation
between quality level and alcohol. The correlation coefficient of quality.cat and citric.acid is 0.226; the
graph shows a weak positive relationship between quality.cat and citric acid.Alcohol and volatile acidity
don’t have any clear relationship between each other.

Did you observe any interesting relationships between the other features (not the main feature(s) of
interest)?
Yes, I observed positive relationship between density and fixed acidity, positive relationship between fixed acidity and citric acid, and negative relationship between pH and fixed acidity.

What was the strongest relationship you found? Quality is positively and strongly correlated with alcohol.

Multivariate Plots Section

The plot reveals a clear pattern, showing most of green dots (high-quality wine) in the place where both
alcohol and sulphates level are high. There is also a visible range of blue dots in the middle of the plot,
This implies that such a combination of variables lets distinguish between different levels of medium-quality wines (5 and 6).

It reveals some patterns in presented data. It is visible here that the majority of green dots are
concentrated in the upper part, while the majority of blue dots are concentrated in the bottom part of the
plot. Thus, this combination of variables may be useful to distinguish medium quality wine from the high
quality.

It is visible here that the wine quality.cat, we see a positive relationship between fixed acidity and
citric acid.

Multivariate Analysis

From wine quality.cat with sulphates and alcohol, implies that such a combination of variables lets
distinguish between different levels of medium-quality wines (5 and 6). From the wine quality.cat, we see a positive relationship between fixed acidity and citric acid.
this combination of variables may be useful to distinguish medium quality wine from the high quality.

Talk about some of the relationships you observed in this part of the investigation. Were there features
that strengthened each other in terms of looking at your feature(s) of interest?
When looking at wine quality level, we see a positive relationship between fixed acidity and citric acid

Final Plots and Summary

Description of plot1

Alcohol have high correlation with wine quality.Alcohol and citric acid are two characteristics that increase a perceived quality of wine the most. pH and volatile acidity, on the contrary, reduce a perceived quality
the most.

Description of Plot2

It reveals some patterns in presented data. It is visible here that the majority of blue dots are
concentrated from middle to lower part, while the majority of green dots are concentrated in the upper part
of the plot. The plot indicates that the average quality of wine is more than that of low quality and high
quality. Alcohol and sulphates, together with other quality increasing characteristics, are doing the hardest job in
predicting red wine quality.

Summary

The wines data set contains information on 1599 wines across twelve variables from around 2009. I started by understanding the individual variables in the data set, and then I explored interesting questions and leads
as I continued to make observations on plots. The main features in the data set are pH and quality. fixed.acidity,sulphates, citric acid, and alcohol likely contribute to the quality of a wine. There is a positive relationship between fixed acidity and citric acid with quality.cat. There is a negative relationships between quality.cat and volatile acidity, and positive correlation between quality level and alcohol. The correlation coefficient of quality.cat and citric.acid is 0.226; the graph
shows a weak positive relationship between quality.cat and citric acid.Alcohol and volatile acidity don’t
have any clear relationship between each other. There is a positive relationship between density and fixed acidity, positive relationship between fixed
acidity and citric acid, and negative relationship between pH and fixed acidity. There are very few wines that are rated as low or high quality. We could improve the quality of our analysis by collecting more data, and creating more variables that may contribute to the quality of wine. Having said that, we have successfully identified features that impact the quality of red wine, visualized their
relationships and summarized their statistics.